packages <- c("tidyverse","janitor")
sapply(packages, library, character.only = T)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'janitor'
##
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## $tidyverse
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## $janitor
## [1] "janitor" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
setwd("~/Documents/han-lab/")
ZooScore Dataset
ZooScore dataset compiles ZooScores determined for a variety of pathogens and parasites collected from the Global Mammal Parasite Database (GMPD). The image below shows the decision tree that a ZooScore is calculated with, ranging from a score of -1 representing a pathogen not found in humans to a score of 3 representing a pathogen capable of human to human transmission (e.g., SARS-CoV-2).
How does the difference between condition 1 (which can be acquired through a vertebrate reservoir) compare to condition 2 (which is not transmitted to other humans) and condition 3 (which is transmissible to other humans)? Does condition 1 imply that its transmissibility to other humans has not been discovered?
Answer: (1) Not possible at the moment.MERS-Cov, Hendra Virus.
glimpse() function below tells the number of rows
and columns, names of the variables, what are the data types of the
variables in the data.ZooScore Data
df_zs%>% glimpse()
## Rows: 2,008
## Columns: 28
## $ parasite_corrected_name <chr> "Acanthocephalus anguillae", "Acanthocephalu…
## $ insect <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ genus_only <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ commensal <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ zoo_score <chr> "-1", "-1", "-1", "-1", "0", "-1", "2", "-1"…
## $ confidence_score <dbl> 3, 3, 1, 2, 1, 1, 1, 1, NA, 1, 1, 2, 2, 1, 2…
## $ xc_zoo_score <dbl> -1, -1, -1, -1, 0, -1, -1, -1, -1, 3, 3, 0, …
## $ xc_c_score <dbl> 1, 2, 2, 2, 2, 3, 2, 3, 3, 1, 1, 1, 1, 1, 1,…
## $ xc_notes <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ xc_who_by <chr> "VR", "VR", "VR", "VR", "VR", "VR", "VR", "V…
## $ xc_date <dbl> 42765, 42765, 42765, 42765, 42765, 42765, 42…
## $ pgf_zoo_score <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_c_score <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ non_gmpd <chr> "0", "0", NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ search_string_goog <chr> "Acanthocephalus anguillae", "Acanthocephalu…
## $ googlehits_as_of_2_8_2017 <dbl> 1410, 431, 217, 65, 713, 52, 404, 82, 64, 65…
## $ search_string_wos <chr> "Acanthocephalus anguillae", "Acanthocephalu…
## $ wo_shits_as_of_2_6_2017 <dbl> 57, 23, 13, 0, 4, 2, 39, 9, 1, 7371, 2464, 2…
## $ notes <chr> NA, NA, "H: dog", "H: primate", NA, "H: Racc…
## $ citation <chr> "Kennedy and Moriarty 1987", "Heckmann et al…
## $ print_ref <chr> NA, NA, NA, NA, NA, NA, "NEED", NA, NA, NA, …
## $ who_by <chr> "VR", "VR", "VR", "VR", "VR", "VR", "VR", "V…
## $ date_entry <dbl> 42765, 42765, 42765, 42765, 42765, 42765, 42…
## $ xc_citation <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_citation <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ pgf_more_citations <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ nematode <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,…
There are 28 columns and 2008 rows. Each column represents a variable
related to the parasite and its zooscore calculated by investigators.
Each row represents each parasite. Since the variable
parasite_corrected_name plays a role of index, the total
number of rows and the unique number of
parasite_corrected_name should be matched. To verify this,
I displayed how many distinct values of
parasite_corrected_name exist.
df_zs %>%
distinct(parasite_corrected_name) %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 2007
Since there is a discrepancy between the number of
parasite_corrected_name and the total number of rows, we
should look at whether or not there were any data entry issues.
df_zs[df_zs$parasite_corrected_name %in% names(which(table(df_zs$parasite_corrected_name) > 1)), ]
## # A tibble: 2 × 28
## parasite_corrected_name insect genus_only commensal zoo_score confidence_score
## <chr> <dbl> <dbl> <lgl> <chr> <dbl>
## 1 Ascaris suum NA 0 NA 2 2
## 2 Ascaris suum NA 0 NA 1 1
## # ℹ 22 more variables: xc_zoo_score <dbl>, xc_c_score <dbl>, xc_notes <lgl>,
## # xc_who_by <chr>, xc_date <dbl>, pgf_zoo_score <dbl>, pgf_c_score <dbl>,
## # pgf_notes <chr>, non_gmpd <chr>, search_string_goog <chr>,
## # googlehits_as_of_2_8_2017 <dbl>, search_string_wos <chr>,
## # wo_shits_as_of_2_6_2017 <dbl>, notes <chr>, citation <chr>,
## # print_ref <chr>, who_by <chr>, date_entry <dbl>, xc_citation <lgl>,
## # pgf_citation <chr>, pgf_more_citations <chr>, nematode <dbl>
One parasite has been identified as duplicates, but the variables associated with each entry are different. This may require further investigation.
purrr::map_dbl(df_zs, ~sum(is.na(.)))
## parasite_corrected_name insect genus_only
## 0 2004 1
## commensal zoo_score confidence_score
## 2008 837 877
## xc_zoo_score xc_c_score xc_notes
## 0 1 2008
## xc_who_by xc_date pgf_zoo_score
## 0 0 1888
## pgf_c_score pgf_notes non_gmpd
## 1887 1874 930
## search_string_goog googlehits_as_of_2_8_2017 search_string_wos
## 0 0 0
## wo_shits_as_of_2_6_2017 notes citation
## 0 1059 2
## print_ref who_by date_entry
## 1760 1 0
## xc_citation pgf_citation pgf_more_citations
## 2008 1895 1931
## nematode
## 1767
df_zs %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "column", values_to = "count") %>%
ggplot(aes(x = column, y = count)) +
geom_bar(stat = "identity", fill = "#D31245", width = 0.5) +
geom_text(aes(label = count), vjust = -0.5, color = "black", size = 2.5) +
xlab("Column") +
ylab("Missing Value Count") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Some variables have too many missing values. In particular,
insect, commensal, xc_notes,
pgf_zoo_score, pgf_c_score,
pgf_notes, notes, print_ref,
xc_citation, pgf_citation,
pgf_more_citations, nematode.
Is there any other dataset that categorizes parasites by their family or genetic tree species? If so, it would help in filling in missing information related to the parasite’s features, such as
insect,commensal, andnematode.
df_zs %>%
summarise(across(everything(), n_distinct)) %>%
pivot_longer(everything(), names_to = "column", values_to = "count") %>%
ggplot(aes(x = column, y = count)) +
geom_bar(stat = "identity", fill = "#D31245", width = 0.5) +
geom_text(aes(label = count), vjust = -0.5, color = "black", size = 2.5) +
xlab("Column") +
ylab("Distinct Count") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
purrr::map(df_zs, n_distinct)
## $parasite_corrected_name
## [1] 2007
##
## $insect
## [1] 2
##
## $genus_only
## [1] 4
##
## $commensal
## [1] 1
##
## $zoo_score
## [1] 17
##
## $confidence_score
## [1] 5
##
## $xc_zoo_score
## [1] 6
##
## $xc_c_score
## [1] 4
##
## $xc_notes
## [1] 1
##
## $xc_who_by
## [1] 1
##
## $xc_date
## [1] 94
##
## $pgf_zoo_score
## [1] 5
##
## $pgf_c_score
## [1] 4
##
## $pgf_notes
## [1] 132
##
## $non_gmpd
## [1] 4
##
## $search_string_goog
## [1] 2007
##
## $googlehits_as_of_2_8_2017
## [1] 939
##
## $search_string_wos
## [1] 2007
##
## $wo_shits_as_of_2_6_2017
## [1] 442
##
## $notes
## [1] 771
##
## $citation
## [1] 1517
##
## $print_ref
## [1] 9
##
## $who_by
## [1] 3
##
## $date_entry
## [1] 93
##
## $xc_citation
## [1] 1
##
## $pgf_citation
## [1] 109
##
## $pgf_more_citations
## [1] 74
##
## $nematode
## [1] 3
parasite_corrected_name]There are 2008 data points in parasite_corrected_name.
As mentioned earlier, each unique value represents a row in this data.
(One duplicate)
df_zs%>%
select(parasite_corrected_name)%>%
mutate(parasite_corrected_name = as.factor(parasite_corrected_name))%>%
summary()
## parasite_corrected_name
## Ascaris suum : 2
## Acanthocephalus anguillae : 1
## Acanthocephalus ranae : 1
## Acanthocheilonema dracunculoides: 1
## Acanthocheilonema gracile : 1
## Acanthocheilonema perstans : 1
## (Other) :2001
insect]There are only four entries, and all of them have the value of insect (1)
df_zs%>%
select(insect)%>%
mutate(insect = as.factor(insect))%>%
summary()
## insect
## 1 : 4
## NA's:2004
df_zs%>%
mutate(insect= as.factor(insect))%>%
ggplot(mapping=aes(x=insect))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
## Warning: `stat(count)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
genus_only]This variable represents the question, Is the pathogen/parasite representing the entire genus?
df_zs%>%
select(genus_only)%>%
mutate(genus_only = as.factor(genus_only))%>%
summary()
## genus_only
## 0 :1853
## 1 : 21
## 3 : 133
## NA's: 1
df_zs%>%
mutate(genus_only= as.factor(genus_only))%>%
ggplot(mapping=aes(x=genus_only))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
What does each values mean? - 0: Is the pathogen/parasite representing the entire genus? - 1: Is the pathogen/parasite representing the entire genus? - 3: Is the pathogen/parasite representing the entire genus?
commensal]This variable represents the question, is the pathogen beneficial without harming its host? However, there are no data points.
df_zs%>%
select(commensal)%>%
mutate(commensal = as.factor(commensal))%>%
summary()
## commensal
## NA's:2008
zoo_score]The score assigned to the pathogen should range from -1 to 3, as stated in the documentation. However, there are 60 values that fall outside this range, in addition to 837 missing (NA) values
df_zs%>%
select(zoo_score)%>%
mutate(zoo_score = as.factor(zoo_score))%>%
summary()
## zoo_score
## -1 :523
## 0 :316
## 1 :165
## 2 : 79
## 3 : 28
## (Other): 60
## NA's :837
df_zs%>%
mutate(zoo_score= as.factor(zoo_score))%>%
mutate(zoo_score = fct_lump(zoo_score, n = 5, other_level = "(other)")) %>%
ggplot(mapping=aes(x=zoo_score))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
confidence_score]The score represents the confidence level in the ZooScore, with 1 indicating high confidence and 3 indicating low/no confidence. There are 877 NA values, which should be investigated to determine the underlying reasons. Additionally, there is one data point that exceeds the range, marked as 33, which is likely a typo and should be 3.
df_zs%>%
select(confidence_score)%>%
mutate(confidence_score = as.factor(confidence_score))%>%
summary()
## confidence_score
## 1 :367
## 2 :399
## 3 :364
## 33 : 1
## NA's:877
df_zs%>%
mutate(confidence_score= as.factor(confidence_score))%>%
ggplot(mapping=aes(x=confidence_score))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
xc_zoo_score]The xc_zoo_score represents the cross-checked ZooScore after review by multiple individuals. It has the same range as the regular ZooScore. The values in xc_zoo_score appear to be more complete compared to zoo_score, as there are no missing (NA) values. There is one data point (-2) that exceeds the expected range.
df_zs%>%
select(xc_zoo_score)%>%
mutate(xc_zoo_score = as.factor(xc_zoo_score))%>%
summary()
## xc_zoo_score
## -2: 1
## -1:1402
## 0 : 335
## 1 : 136
## 2 : 70
## 3 : 64
df_zs%>%
mutate(xc_zoo_score= as.factor(xc_zoo_score))%>%
ggplot(mapping=aes(x=xc_zoo_score))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
xc_c_score]The xc_c_score represents the cross-checked confidence score after review by multiple individuals. It has the same range as the regular confidence score. The values in xc_c_score appear to be more complete compared to confidence_score, as there are less missing (NA) values. All data points are within the expected range.
Are xc_zoo_score and xc_c_score considered as the final ZooScore and Confidence score after cross-checking by multiple individuals?”
df_zs%>%
select(xc_c_score)%>%
mutate(xc_c_score = as.factor(xc_c_score))%>%
summary()
## xc_c_score
## 1 :693
## 2 :433
## 3 :881
## NA's: 1
df_zs%>%
mutate(xc_c_score= as.factor(xc_c_score))%>%
ggplot(mapping=aes(x=xc_c_score))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
xc_notes]There were no xc_notes entries.
df_zs%>%
select(xc_notes)%>%
mutate(xc_note = as.factor(xc_notes))%>%
summary()
## xc_notes xc_note
## Mode:logical NA's:2008
## NA's:2008
xc_who_by]The xc_who_by variable represents the investigator who performed the cross-checking. There was only one unique value, which is ‘VR’.
df_zs%>%
select(xc_who_by)%>%
mutate(xc_who_by = as.factor(xc_who_by))%>%
summary()
## xc_who_by
## VR:2008
xc_date]The xc_date variable represents the date when the cross-checking was performed. The original date format was not meaningful and needed to be converted appropriately using the format “%Y-%m-%d”. The cross-checking process started on 2016-07-11 and completed with its last entry on 2017-01-30.
df_zs$xc_date <- as.Date(df_zs$xc_date, origin = "1899-12-30")
df_zs%>%
select(xc_date)%>%
summary()
## xc_date
## Min. :2016-07-11
## 1st Qu.:2016-10-25
## Median :2016-11-17
## Mean :2016-11-15
## 3rd Qu.:2016-12-15
## Max. :2017-01-30
df_zs %>%
mutate(the_year = lubridate::year(xc_date),
the_month = lubridate::month(xc_date)) %>%
ggplot(mapping = aes(x = the_year, y = 1)) +
geom_jitter(width = 0.4, height = 0.07, alpha = 0.2, color = '#D31245') +
theme_bw() +
theme(panel.grid.minor.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
pgf-variables ]These variables (pgf_zoo_score,
pgf_c_score, pgf_notes,
pgf_citation, pgf_more_citations) are provided
by Pasha, a former lab member. However, it is unclear how meaningful
they are in the context of the analysis. Further investigation is needed
to determine their significance.
df_zs%>%
select(pgf_zoo_score)%>%
mutate(pgf_zoo_score = as.factor(pgf_zoo_score))%>%
summary()
## pgf_zoo_score
## -1 : 9
## 0 : 41
## 2 : 56
## 3 : 14
## NA's:1888
df_zs%>%
select(pgf_c_score)%>%
mutate(pgf_c_score = as.factor(pgf_c_score))%>%
summary()
## pgf_c_score
## 1 : 41
## 2 : 41
## 3 : 39
## NA's:1887
df_zs%>%
select(pgf_citation)%>%
mutate(pgf_citation = as.factor(pgf_citation))%>%
summary()
## pgf_citation
## Acha : 5
## Falkinham III 1996 : 2
## Acha et al. vol II pg. 229: 1
## Acha Vol I : 1
## Albert & Stevens 2010 : 1
## (Other) : 103
## NA's :1895
df_zs%>%
select(pgf_more_citations)%>%
mutate(pgf_more_citations = as.factor(pgf_more_citations))%>%
summary()
## pgf_more_citations
## Abrahamian and Goldstein 2011 : 3
## Acha : 3
## Acha vol 1, pg 199 : 1
## Acha vol 3, pg. 64 & Coatney 1971: 1
## Acha, vol 3, pg 63+ & Baird 2009 : 1
## (Other) : 68
## NA's :1931
non_gmpd]The non_gmpd variable indicates whether a pathogen is
not present in GMPD (Global Microbial Pathogen Database). In the
dataset, there are 30 data points that are categorized as not sourced
from GMPD. Additionally, there are 930 missing values.
df_zs%>%
select(non_gmpd)%>%
mutate(non_gmpd = as.factor(non_gmpd))%>%
summary()
## non_gmpd
## 0 :1047
## 1 : 30
## Meningonema peruzzii transmission: 1
## NA's : 930
df_zs%>%
mutate(non_gmpd= as.factor(non_gmpd))%>%
ggplot(mapping=aes(x=non_gmpd))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()
search_string_goog]The search_string_goog variable represents the exact
search string used for Google Scholar search. The results appear to be
identical to the values in the parasite_corrected_name
variable.
df_zs%>%
select(search_string_goog)%>%
mutate(search_string_goog = as.factor(search_string_goog))%>%
summary()
## search_string_goog
## Ascaris suum : 2
## Acanthocephalus anguillae : 1
## Acanthocephalus ranae : 1
## Acanthocheilonema dracunculoides: 1
## Acanthocheilonema gracile : 1
## Acanthocheilonema perstans : 1
## (Other) :2001
googlehits]The googlehits_as_of_2_8_2017 represents how many hits
were found in Google Scholar. It shows a right-skew in the distribution
of googlehits_as_of_2_8_2017. The presence of a large maximum value
(2650000.0) seems significantly influence the mean and skew the
distribution.
df_zs%>%
select(googlehits_as_of_2_8_2017)%>%
summary()
## googlehits_as_of_2_8_2017
## Min. : 0.0
## 1st Qu.: 35.0
## Median : 214.5
## Mean : 12542.1
## 3rd Qu.: 2122.5
## Max. :2650000.0
df_zs %>%
ggplot(mapping = aes(x = googlehits_as_of_2_8_2017)) +
geom_rug(size = 1) +
stat_ecdf(size = 1.2) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
search_string_wos]The search_string_wos variable represents the exact
search string used for Web of Science search. The results appear to be
identical to the values in the parasite_corrected_name
variable.
df_zs%>%
select(search_string_wos)%>%
mutate(search_string_wos = as.factor(search_string_wos))%>%
summary()
## search_string_wos
## Ascaris suum : 2
## Acanthocephalus anguillae : 1
## Acanthocephalus ranae : 1
## Acanthocheilonema dracunculoides: 1
## Acanthocheilonema gracile : 1
## Acanthocheilonema perstans : 1
## (Other) :2001
wo_shits]The `wo_shits_as_of_2_6_2017`` represents how many hits were found in Web of Science. It shows a similar distribution as google hits’.
df_zs%>%
select(wo_shits_as_of_2_6_2017)%>%
summary()
## wo_shits_as_of_2_6_2017
## Min. : 0.0
## 1st Qu.: 1.0
## Median : 5.0
## Mean : 708.3
## 3rd Qu.: 68.0
## Max. :391855.0
df_zs %>%
ggplot(mapping = aes(x = wo_shits_as_of_2_6_2017)) +
geom_rug(size = 1) +
stat_ecdf(size = 1.2) +
theme_bw()
notes]This includes any notes for the record.
df_zs%>%
select(notes)%>%
mutate(notes = as.factor(notes))%>%
summary()
## notes
## equid : 19
## Nematoda : 13
## Nematode : 12
## Bacterium : 10
## tick vector: 10
## (Other) : 885
## NA's :1059
citation]This variable represents the citations used to support the ZooScore. The folder where these citations are stored can be shared. It would also be interesting to explore if there are any investigators who are specifically associated with certain pathogens or parasites based on the citations.
df_zs%>%
select(citation)%>%
mutate(citation = as.factor(citation))%>%
summary()
## citation
## Gideon : 87
## NEED : 51
## Stuart et al., 1998 : 18
## Irwin and Raharison 2009: 13
## Scialdo-Krecek 1983 : 12
## (Other) :1825
## NA's : 2
print_rf]This variable represents the answer to the question, “Do we need a print version of the reference/does the reference exist in print form only”. However, it is necessary to provide clarification on the values associated with this variable.
df_zs%>%
select(print_ref)%>%
mutate(print_ref = as.factor(print_ref))%>%
summary()
## print_ref
## Gideon : 164
## NEED : 77
## yes : 2
## GET : 1
## GS : 1
## (Other): 3
## NA's :1760
df_zs%>%
mutate(print_ref= as.factor(print_ref))%>%
ggplot(mapping=aes(x=print_ref))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
who_by]This variable represents the investigator who assigned the data points. All data points seem primarily assigned to one investigator, VR.
df_zs%>%
select(who_by)%>%
mutate(who_by = as.factor(who_by))%>%
summary()
## who_by
## Vr : 1
## VR :2006
## NA's: 1
date_entry]The date_entry variable represents the date when the zooscore was assigned. The original date format was not meaningful and needed to be converted appropriately using the format “%Y-%m-%d”. The date-entry started on 2016-01-8 and the last entry is 2017-01-30.
df_zs$date_entry <- as.Date(df_zs$date_entry, origin = "1899-12-30")
df_zs%>%
select(date_entry)%>%
summary()
## date_entry
## Min. :2016-01-08
## 1st Qu.:2016-10-25
## Median :2016-11-17
## Mean :2016-11-14
## 3rd Qu.:2016-12-15
## Max. :2017-01-30
df_zs %>%
mutate(the_year = lubridate::year(date_entry),
the_month = lubridate::month(date_entry)) %>%
ggplot(mapping = aes(x = the_year, y = 1)) +
geom_jitter(width = 0.4, height = 0.07, alpha = 0.2, color = '#D31245') +
theme_bw() +
theme(panel.grid.minor.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
xc_citation]No data points available.
df_zs%>%
select(xc_citation)%>%
mutate(xc_citation = as.factor(xc_citation))%>%
summary()
## xc_citation
## NA's:2008
nematode]This variable indicates whether the pathogen/parasite is a nematode (worm).
df_zs%>%
select(nematode)%>%
mutate(nematode = as.factor(nematode))%>%
summary()
## nematode
## 0 : 151
## 1 : 90
## NA's:1767
df_zs%>%
mutate(nematode= as.factor(nematode))%>%
ggplot(mapping=aes(x=nematode))+
geom_bar()+
geom_label(stat='count',
mapping =aes(label = stat(count)),
color = '#D31245',size = 4, vjust= 0.3 )+
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
# Save df_zs to a CSV file
write.csv(df_zs, "df_zs.csv", row.names = FALSE)
xc_zoo_score.xc_zoo_score and genusThe meaning of the values in the genus_only variable is currently unknown. Values such as 0, 1, 2, and 3 do not have a defined interpretation at this point. However, based on observations, it appears that when the genus_only value is either 0 or 3, there are indications of zoonotic features.
df_zs %>%
ggplot(mapping = aes(x =xc_zoo_score))+
geom_bar(mapping = aes(fill = as.factor(xc_zoo_score)))+
facet_wrap(~genus_only)+
ggthemes::scale_fill_colorblind('xc_zoo_score')+
theme_bw()
xc_zoo_score and nematodeThe high number of missing values limits the interpretability and significance of any zooscore associated with the nematode variable.
df_zs %>%
ggplot(mapping = aes(x =xc_zoo_score))+
geom_bar(mapping = aes(fill = as.factor(xc_zoo_score)))+
facet_wrap(~nematode)+
ggthemes::scale_fill_colorblind('xc_zoo_score')+
theme_bw()
xc_zoo_score and xc_c_scoredf_zs %>%
filter(!is.na(xc_zoo_score) & !is.na(xc_c_score))%>%
count(xc_zoo_score, xc_c_score) %>%
ggplot(mapping = aes(x=xc_zoo_score, y=xc_c_score))+
geom_tile(mapping = aes(fill = n),
color = 'black')+
geom_label(mapping = aes(label = n,
color = n > median(n)),
size = 2.5)+
scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
"FALSE" = '#091F40'))+
scale_fill_continuous()+
theme_bw()+
theme(axis.text.x = element_text(angle = 20, hjust=1))
library(corrr)
Correlation between google search hits and web of science search hit is strong.
library(corrplot)
## corrplot 0.92 loaded
cor_matrix <- df_zs[c('googlehits_as_of_2_8_2017', 'wo_shits_as_of_2_6_2017', 'xc_zoo_score', 'xc_c_score')] %>%
filter(!is.na(googlehits_as_of_2_8_2017) & !is.na(wo_shits_as_of_2_6_2017)) %>%
cor(use = "pairwise.complete.obs")
corrplot(cor_matrix, type = 'upper', method = 'square',
order = 'hclust', hclust.method = 'ward.D2')
df_zs[c('googlehits_as_of_2_8_2017', 'wo_shits_as_of_2_6_2017', 'xc_zoo_score', 'xc_c_score')] %>%
filter(!is.na(googlehits_as_of_2_8_2017) & !is.na(wo_shits_as_of_2_6_2017)) %>%
correlate(diagonal = 1, quiet = TRUE) %>%
stretch() %>%
ggplot(mapping = aes(x = x, y = y)) +
geom_tile(mapping = aes(fill = r),
color = 'black') +
geom_text(mapping = aes(label = round(r, 2)),
size = 6) +
coord_equal() +
scale_fill_gradient2(low = 'red', mid = 'white', high = 'blue',
midpoint = 0,
limits = c(-1, 1)) +
labs(x = '', y = '') +
theme_bw()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Compare data.
df_sub%>% glimpse()
## Rows: 4,350
## Columns: 89
## $ pathogen <chr> "Acanthocephalus anguill…
## $ insect <dbl> NA, NA, NA, NA, NA, NA, …
## $ genus_only <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ commensal <lgl> NA, NA, NA, NA, NA, NA, …
## $ zoo_score <chr> "-1", "-1", "-1", "-1", …
## $ confidence_score <dbl> 3, 3, 1, 2, 1, 1, 1, 1, …
## $ xc_zoo_score <dbl> -1, -1, -1, -1, 0, -1, -…
## $ xc_c_score <dbl> 1, 2, 2, 2, 2, 3, 2, 3, …
## $ xc_notes <lgl> NA, NA, NA, NA, NA, NA, …
## $ xc_who_by <chr> "VR", "VR", "VR", "VR", …
## $ xc_date <date> 2017-01-30, 2017-01-30,…
## $ pgf_zoo_score <dbl> NA, NA, NA, NA, NA, NA, …
## $ pgf_c_score <dbl> NA, NA, NA, NA, NA, NA, …
## $ pgf_notes <chr> NA, NA, NA, NA, NA, NA, …
## $ non_gmpd <chr> "0", "0", NA, NA, NA, NA…
## $ search_string_goog <chr> "Acanthocephalus anguill…
## $ googlehits_as_of_2_8_2017 <dbl> 1410, 431, 217, 65, 713,…
## $ search_string_wos <chr> "Acanthocephalus anguill…
## $ wo_shits_as_of_2_6_2017 <dbl> 57, 23, 13, 0, 4, 2, 39,…
## $ notes <chr> NA, NA, "H: dog", "H: pr…
## $ citation <chr> "Kennedy and Moriarty 19…
## $ print_ref <chr> NA, NA, NA, NA, NA, NA, …
## $ who_by <chr> "VR", "VR", "VR", "VR", …
## $ date_entry <date> 2017-01-30, 2017-01-30,…
## $ xc_citation <lgl> NA, NA, NA, NA, NA, NA, …
## $ pgf_citation <chr> NA, NA, NA, NA, NA, NA, …
## $ pgf_more_citations <chr> NA, NA, NA, NA, NA, NA, …
## $ nematode <dbl> 0, 0, 0, 1, 1, 1, 1, 1, …
## $ species <chr> NA, NA, NA, NA, NA, NA, …
## $ disease <chr> NA, NA, NA, NA, NA, NA, …
## $ close <dbl> NA, NA, NA, NA, NA, NA, …
## $ nonclose <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector <dbl> NA, NA, NA, NA, NA, NA, …
## $ intermediate <dbl> NA, NA, NA, NA, NA, NA, …
## $ country <chr> NA, NA, NA, NA, NA, NA, …
## $ DOI <chr> NA, NA, NA, NA, NA, NA, …
## $ evidence <lgl> NA, NA, NA, NA, NA, NA, …
## $ evidence_notes <lgl> NA, NA, NA, NA, NA, NA, …
## $ source <lgl> NA, NA, NA, NA, NA, NA, …
## $ checked_by <lgl> NA, NA, NA, NA, NA, NA, …
## $ variable <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_transmission <dbl> NA, NA, NA, NA, NA, NA, …
## $ zoonotic <dbl> NA, NA, NA, NA, NA, NA, …
## $ type <chr> NA, NA, NA, NA, NA, NA, …
## $ parasite_protozoa <dbl> NA, NA, NA, NA, NA, NA, …
## $ bacterium <dbl> NA, NA, NA, NA, NA, NA, …
## $ fungi <dbl> NA, NA, NA, NA, NA, NA, …
## $ virus_rna <dbl> NA, NA, NA, NA, NA, NA, …
## $ virus_dna <dbl> NA, NA, NA, NA, NA, NA, …
## $ parasite_other <dbl> NA, NA, NA, NA, NA, NA, …
## $ disease_code <dbl> NA, NA, NA, NA, NA, NA, …
## $ incubation <chr> NA, NA, NA, NA, NA, NA, …
## $ found_worldwide <dbl> NA, NA, NA, NA, NA, NA, …
## $ virus_family <chr> NA, NA, NA, NA, NA, NA, …
## $ virus_genus <chr> NA, NA, NA, NA, NA, NA, …
## $ vehicle_eaten_insect_mite_copepod <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_none <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_respiratory_or_pharyngeal_acquisition <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_water <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_direct_physical_contact <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_shellfish <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_trauma <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_dairy_products <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_meat_or_poultry <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_fecal_oral_human <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_fly <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_food <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_sexual_contact <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_vegetable_or_fruit <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_secretion_blood_or_tissue <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_amphibian_or_reptile <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_snail_earthworm_or_slug <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_animal_bite <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_droplet_dust_or_aerosol <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_fish <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_soil_or_vegetable_matter <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_unknown <dbl> NA, NA, NA, NA, NA, NA, …
## $ vehicle_breastfeeding <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_none <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_tick <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_fly <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_flea <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_louse <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_mite <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_mosquito <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_sandfly <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_unknown <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_midge <dbl> NA, NA, NA, NA, NA, NA, …
## $ vector_bug <dbl> NA, NA, NA, NA, NA, NA, …
df_sub %>%
filter(!is.na(xc_zoo_score) & !is.na(zoonotic))%>%
count(xc_zoo_score, zoonotic) %>%
ggplot(mapping = aes(x=as.factor(xc_zoo_score), y=as.factor(zoonotic)))+
geom_tile(mapping = aes(fill = n),
color = 'black')+
geom_label(mapping = aes(label = n,
color = n > median(n)),
size = 2.5)+
facet_wrap(~zoonotic)+
scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
"FALSE" = '#091F40'))+
scale_fill_continuous()+
theme_bw()+
theme(axis.text.x = element_text(angle = 20, hjust=1))
df_sub %>%
filter(!is.na(xc_zoo_score) & !is.na(close))%>%
count(xc_zoo_score, virus_rna, close) %>%
ggplot(mapping = aes(x=as.factor(virus_rna), y= as.factor(xc_zoo_score)))+
geom_tile(mapping = aes(fill = n),
color = 'black')+
geom_label(mapping = aes(label = n,
color = n > median(n)),
size = 2.5)+
facet_wrap(~close)+
scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
"FALSE" = '#091F40'))+
scale_fill_continuous()+
theme_bw()+
theme(axis.text.x = element_text(angle = 20, hjust=1))
— to be continued–
This project has been funded with Federal funds from the National Library of Medicine (NLM), National Institutes of Health (NIH), under cooperative agreement number UG4LM01234 with the University of Massachusetts Chan Medical School, Lamar Soutter Library. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.